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Abstract 

Various techniques, used to optimise on-line principal component anal- 
ysis, are investigated by methods of statistical mechanics. These include 
local and global optimisation of node-dependent learning-rates which are 
shown to be very efficient in speeding up the learning process. They 
are investigated further for gaining insight into the learning rates' time- 
dependence, which is then employed for devising simple practical methods 
to improve training performance. Simulations demonstrate the benefit 
gained from using the new methods. 

1. Introduction 

The investigation of unsupervised on-line learning algorithms Jl|, 3 by means 
of statistical mechanics has been shown to be a useful tool for gaining insight on 
the training dynamics ||. In contrast to batch algorithms whereby all available 
examples are considered simultaneously for calculating a single student parame- 
ters update, on-line updates are carried out after the presentation of each single 
data point (for an overview on current on-line methods in neural networks see Q). 
This update is proportional to a learning rate rj that has to be smaller than a 
critical value to make learning possible 0. Successful learning is only possible if 
the learning rate is relatively small which, at the same time, means that many 
update steps are needed. Therefore, a relatively large rate is needed at the be- 
ginning and a smaller one later on; perfect learning is only possible if 77 — ^ at 
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late stages of the learning process. For practical problems there is only empiri- 
cal knowledge of how the learning rate has to evolve @. The use of variational 
techniques [J enables one to calculate the optimal learning rate evolution 77 
theoretically; however, these calculations require information about the task and 
the input distribution which is usually unavailable. Nevertheless, insight gained 
from the analysis about the optimal learning rate time-dependence may be used 
to improve training in practical scenarios. 

There are mainly two learning rate optimisation paradigms which we will dis- 
cuss here: Local optimisation maximises the cost function loss at every time step 
while global optimisation seeks the maximisation of the cost function loss within 
a predetermined time window. Note that towards the end of the time window 
the two methods coincide and that a sufficiently long time window should be 
considered for the system to converge. 

2. General Framework 

The algorithm examined here is an on-line-algorithm for principle component 
analysis based on Sanger's rule 0. It was already discussed in detail for con- 
stant learning rates rji j7|. We consider here iV-dimensional data vectors £ taken 
independently from a Gaussian data-distribution with M relevant orthonormal di- 
rections {-Bj}i=i v ..,M (M <C N, BjB_j = 5ij). The correlation matrix C_ = ^££ T 
of this distribution has the form 

M 

Q = L + E( b " + 2b ^Bj , (l) 
i=i 

were {h}^ are some positive parameters representing the specific task and I_ is 
the identity matrix. 

In the on-line- scenario a single vector £ M is presented every time step and a 
set of student vectors Jj e IR N (I — 1,2, ... , M) is updated according to 



n^)=ii^-^) + i-r x i\^-Y.<u^-m , (2) 




with the student projections xf = J z The student vectors are normalized 
explicitly after each time step. 

In the limit N — > 00 the evolution of the system can be described by a set 
of coupled differential equations in 'time' a = fi/N for the quantities Rki(^) = 
J-kil^Bj and Qki{^) = ^{lA^Al 1 ) which describe the overlaps of the student 
vectors with the unknown principal components and the mutual overlap: 



dR 



- HI \ ; 

l-l 



Vi (x iyj )-( Vl + rf/2) (xf)i^ (3) 



-r]i ( x i x k ) (Rkj - QikRij) (fc, l = 1, 2, . . . , M) 
fe=i 
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l—l m—l 

-T)l ( x l x k ) (Qkm - QklQlm) " ?7m E ( X m X k ) (Qlk ~ QkmQlm) (I ^ TTl) 
k=l k=l 

The averages over the quantities x k = jjt; and yj = Bjt; can be performed 
analytically yielding 

( XkVj ) = (1 + b i) 2R kj, ( ViVj ) = (1 + bjfSy, and (4) 

M 

( x k x t ) = Q M + Y^( h 1 + 2bi)R ki Rii . 

i 

An investigation of this learning scenario with constant learning rates showed 
that the entire process depends crucially on the learning rate. Learning rates 
have to be slightly different for each student vector to break symmetries which 
emerge between them during training and to avoid time-consuming plateaus pj. 
The values have to be chosen between large learning rates which are suboptimal 
asymptotically and small learning rates that result in prohibitively slow learning 
at the transient. 

To improve learning performance and speed it is necessary to choose time- 
dependent learning rates. As it was already shown that different learning rates 
for different nodes are important |7j] we focussed on finding appropriate solutions 
for a node-dependent learning rates rji(a). 

3. Locally Optimised Learning Rate 

One way to calculate an optimised learning rate is to maximise the cost func- 
tion loss in every time-step (local optimisation), i.e., obtaining f]i(a) from a min- 
imisation of de/ da 0: 

drji da ' ^ ^ 

Choosing the cost function 

1 M 

£ = 1-TfE^«- ( 6 ) 

This function is a measure of the learning success on a scale between 1 and 0, 
representing poor and optimal performance respectively, and may be used to 
derive the locally optimal learning rate of the form 

/ \ , . + T,k=i(Qik + Aik)(Rki - QkiRu) (r? s 

"'W = - 1 + (TT^) (TTauW, • (7) 

with 

M 

A k i = E( & ' + Zb^RikRu . (8) 

i=l 
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Figure 1: Time-dependence of the learning rates of the first three students cal- 
culated with local optimisation. 

This learning rate depends on the data structure and the order parameters of 
the problem. By choosing these optimal learning rates, the principal components 
are learnt very fast and high performance can be achieved. In the following we 
choose a data distrbution (1) with b\ = 0.6, b 2 = 0.4 and b 3 = 0.3. Figure [I] shows 
the evolution of the learning rates for the first three student vectors. They all 
begin with a constant value which depends on the data structure and have a de- 
caying phase later on where the learning rate decays roughly as 1/a. In addition, 
the learning rates show a 'dip' at the point where another student vector learns 
the current, most-dominant, principal component direction. This behaviour is 
explained by Figure ||], showing the overlaps R of the students with the princi- 
pal components (upper curves) and their mutual overlap Q (lower curves): The 
principal components are learned one after another; all students try to learn the 
largest p.c. first, which results in a significant overlap Q with the first student. 
The orthogonalisation realised by the algorithm pushes the other students away 
from that direction to specialise on other directions related to the less dominant 
p.c. Once the direction of the p.c. has been identified, the related learning rate, 
of the specialised student vector, starts decaying. At the same time there is a 
significant overlap with the other student vectors that started learning the same 
direction; consequently, their learning rates are suppressed so as to prevent them 
from specialising on this p.c. any further and to facilitate the change in direction. 

Figure || shows the evolution of the cost function(H). Curve (a) represents 
a learning scenario with reasonably chosen constant learning rates (771 = 0.1, 
r]2 = 0.108 and 773 = 0.09) balancing between training speed and asymptotic 
performance. The hierarchical structure of the learning process can be noticed 
here as the three students learn the different principal components one after the 
other. The same learning process but with locally optimised learning rates is 
shown in curve (b). The principal components are learned very fast, resulting in 
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Figure 2: Overlap of the first three students with their principal components and 
their mutual overlap, learning with the the locally optimised learning rates shown 
above. 
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Figure 3: Cost functions for the detection of the first three components in on-line 
PCA: the graph shows the learning process with constant learning rates (a), with 
locally optimised (b) and with globally optimised (c) learning rates. 
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very good asymptotic performance. The locally optimised learning rate clearly 
provides improved performance with respect to every constant rate. However, it 
depends on knowledge that is not available in practical situations and can there- 
fore only provide insight into the optimal evolution of rji. 

4. Globally Optimised Learning Rate 

As the learning process may comprise different phases, for which local optimi- 
sation may result in sub-optimal global performance, we will also consider here a 
different approach based on global optimisation || of the learning rate. This has 
been shown to outperform local optimisation over a predetermined time window. 
This method maximises the cost function loss over a fixed time window: 

rai I fa \ 

Ae = / da * (constraints) (9) 

where the constraints are the equations of motion (|3|) which have to be satisfied 
at every point in time and A, are the related Lagrange multipliers. The time- 
window a\ — ao has to be chosen beforehand. Applying a variational approach 
with respect to the order parameters and their time derivatives leads to a set 
of differential equations for the Lagrange multipliers, from which the globally 
optimised learning rates can be derived. Clearly, like in any other method, a 
minimal time-window is required for the learning to converge. 

Globally optimal parametrisation was shown to be much more efficient in 
the case of plateaus in the learning process where local optimisation leads to 
indefinite trapping ||. However, one has to keep in mind that global optimisation 
only looks at the total loss Ae, so that intermediate values of e can be much 
worse than those obtained via local optimisation. In the case of on-line-PCA it 
turns out that that after the minimal time needed for the algorithm to converge, 
the learning performance of locally and globally optimised learning are similar. 
Figure |^ shows the evolution of the learning process in both cases. 

One can notice that the principal components are found later than with local 
optimisation. This can be explained in the following way: If one component is 
found more accurately, the orthogonalisation process can push the other students 
much more efficiently out of that direction, so that learning the next p.c. becomes 
easier. Local optimisation does not rely on future gains and therefore chooses to 
carry on with the specialisation of student vectors, providing better intermediate 
performance. The features of the globally optimised learning rates are similar to 
those obtained via local optimisation. 

Global optimisation would have been useful in the case of plateaus; these 
emerge in the case of on-line-PCA only for a single learning rate f]i{a) = rj(a) 0. 
Therefore, in most cases, global optimisation will not have any advantage over 
local optimisation. 

Note that instead of calculating a globally optimised learning rate that leaves 
the learning rule itself unchanged, one can also calculate a globally optimised 
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learning rule [jTIJ. In our case this can only be calculated numerically and does 
not provide additional information. 



5. Discussion 

The use of locally optimised learning rates shows a significant improvement in 
the learning performance over fixed learning rates, but it depends on quantities 
that are not available in practical applications of on-line-PCA. Nevertheless, in- 
sight gained from the theoretical study may be useful for improving performance 
in practical cases. From the analysis it is clear that the learning rates have to be 
constant first and should decay like 1/a at later times, after specialisation took 
place. This point, where the learning rate schedule should be changed has to be 
set through observables accessible in practical scenarios; typically, one could use 
constant learning rates until the asymptotic regime is reached, identified through 
students stationarity. Our analysis provides a refined criterion which leads to 
much faster learning: In figure [l] and Q on notices that the decaying phase for 
a certain student starts where the overlap to other students (usually to one in 
particular) starts growing significantly. At this point the first student has al- 
ready learned the current most-dominant p.c. and becomes almost stationary 
while other students (one in particular), which show significant correlation with 
the first student, should start moving to other directions, being pushed away 
by the orthogonalisation process. The first student has learned enough to sta- 
bilise and to begin its 'fine tuning' which corresponds to the stage of a decaying 
learning rate. The mutual overlaps are the only order parameters accessible in 
real applications; here they provide a practical criterion for the starting point of 
the learning rate decay phase, a criterion which was obtained directly from the 
analysis. 

6. Simulation 

Simulations of an on-line principle component analysis were made to test 
the usefulness of the criterion explained above. Fig |] displays a scenario with 
constant learning rates (771 = 0.1 and 772 = 0.09), learning the same data distri- 
bution as before. The graph shows the overlaps of the first two students with 
the corresponding principal components Ru and their mutual overlap Q as means 
and variances of ten runs. The asymptotic regime is reached at the end of the 
time scale; at this point one would normally commence the decay of the learning 
rates. In comparison, we applied the rule suggested above, based on monitoring 
the overlaps between student vectors, to the same task as shown in figure [5[ As 
soon as the overlap between two students starts growing significantly the decay of 
the first student commences; the decay for the next student commences according 
to similar criteria. This corresponds directly to the observations of the optimised 
learning process. We should point out that the starting value of the learning 
rates can be chosen higher than those in the constant case since the decay starts 
very early. This demonstrates the efficiency of the rule developed here which is 
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Figure 4: Simulation of on-line-PCA with constant learning rates: Overlaps of 
the first two students with their PC's Ru and their mutual overlap Q. 

applicable to practical scenarios. 
7. Conclusion 

A statistical mechanics approach to optimising on-line principal component 
analysis provides insight to the learning process. The theoretically obtained time- 
dependent optimal learning rates depend on quantities which are not accessible 
in practical applications; however, examining the optimal learning scenarios led 
to the development of a practical technique for speeding up the training process 
on the basis of observables that can be easily monitored in practical scenarios. 
The new method has been demonstrated on a simple problem and was shown to 
improve the training performance considerably. 

Acknowledgments This work has been partially support by the EU grant 
CHRX-CT92-0063 and the British Council grant: British-German Academic Research 
Collaboration Programme project 1037. DS also acknowledges support from the Lev- 
er hulme Trust (F/250/K). ES and DS would like to thank Magnus Rattray for usefull 
discussions and suggestions. 



9 




Figure 5: Simulation of on-line PCA like in figure [|. The learning rates are now 
chosen as time-dependent according to the suggested rule. A comparison with 
figure 4 demonstrates clearly the efficiency of this method. 
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